library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(wordcloud2)
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loaded glmnet 4.1-8
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(ROCR)

group members

Yuxuan Du: yd2739

Tara Zhan: fz2377

Yaduo Wang: yw4199

Chenshuo Pan :cp3384

Wenxin Tian: wt2369

Project motivation

Amidst the burgeoning gaming industry, popular belief posits genres such as shooters, sports, and racing as perennial favorites. As gamers, our team believes that there is a “success formula” to guarantee the success of video games. Our team’s avid gaming experience through STEAM has piqued our interest in this hypothesis. Our objective is to dissect and analyze the veracity of such claims and identify any characteristics that signal the popularity of STEAM games.

anticipated data scources

https://www.kaggle.com/datasets/fronkongames/steam-games-dataset

Data import and cleaning

In the process of preparing our data for analysis, we have established a set of inclusion and variable selection criteria to ensure the relevance of our study. These criteria are vital to the successful execution of our data science investigation.

One of the primary inclusion criteria for our analysis is the exclusion of free games from our dataset. This decision is underpinned by the understanding that in free-to-play games, in-app purchases often serve as viable alternatives to traditional game sales. However, due to the unavailability of comprehensive data pertaining to these in-app purchases, their inclusion in our analysis would be both impractical and imprecise.

In addition, we have also opted to remove certain variables from our dataset that are non-essential for our analysis.

We eliminate columns containing URLs as they do not contribute meaningful information to our analytical objectives.

Variables associated with game developers and publishers will be filtered out for two primary reasons. Firstly, we posit that making predictions or quantify the effect of developer names may be challenging, if not unfeasible. Secondly, we hypothesize that brand effects can be more accurately captured through a combination of game features and keywords, for example, there is a tag for independent games, which is the complement of games made by large companies.

Furthermore, we will exclude the variables “Playtime in Current Two Weeks” as it may not be representative for older games, and “Average Playtime Forever” in favor of “Median Playtime Forever” due to potential distortions caused by AFK (Away From Keyboard) gamers.

It is worth noting that we are cognizant of the need to address potential data leakage issues and redundant feature when fitting our model. These aspects will be thoroughly addressed in subsequent stages of our analysis to ensure the robustness of our findings.

df = read.csv("DATA/games.csv")|>
  janitor::clean_names()|>
  subset(select = -c(dlc_count, about_the_game, reviews, header_image, website, support_url, support_email, metacritic_score, metacritic_url, notes, developers, publishers, screenshots, movies, score_rank, average_playtime_two_weeks, median_playtime_two_weeks, average_playtime_forever, peak_ccu, full_audio_languages))|>
  subset(price > 0)
#Change the format of release date. 
df = df|>
  mutate(release_date = as.Date(release_date, format = "%b %d, %Y"))

EDA

price distribution of games with different estimated owners

# we filter out Ascent Free-Roaming VR Experience since its price is 1000 dollars, which will make it hard to see any trend from our plot. 
df|>
  subset(name != "Ascent Free-Roaming VR Experience")|>
  separate_wider_delim(cols = estimated_owners, delim = " - ", names = c("estimate_low", "estimate_high"))|>
  mutate(estimate_low = as.numeric(estimate_low))|>
  mutate(estimate_high = as.numeric(estimate_high))|>
  mutate(estimate_mean = (estimate_high + estimate_low)/2)|>
  mutate(estimate_mean = factor(estimate_mean))|>
  ggplot(aes(x = estimate_mean, y = price))+
  geom_boxplot()+
  coord_flip()

From the price distribution plot of games categorized by different estimated owner counts, a notable pattern emerges. Specifically, games with a greater number of estimated owners tend to exhibit relatively higher prices. However, an intriguing observation is that once a certain threshold of estimated owners is reached, the prices of games begin to decline. ### Overall frequency of keywords(genre, categories, tags)

Here, we try to explore the trend of keywords in genre, categories, tags.

categories_freq = df|>
  separate_rows(categories, sep = ",")|>
  group_by(categories)|>
  summarise(n_obs = n())|>
  arrange(desc(n_obs))|>
  ggplot(aes(x = reorder(categories, +n_obs), y = n_obs))+
  geom_bar(stat="identity")+
  theme(axis.text.x = element_text(angle = 90, vjust = 1, hjust=1)) +
  coord_flip()

categories_freq

From the bar plot, it is evident that single-player games have highest frequency in our dataset, this is reasonable since multi-player games require higher optimization, single player games is the best choice for independent game maker. The VR related categories have low frequency in games, this is because the equipment is expensive and the technology is relatively new.

genre_freq = df|>
  separate_rows(genres, sep = ",")|>
  group_by(genres)|>
  summarise(n_obs = n())|>
  arrange(desc(n_obs))|>
  ggplot(aes(x = reorder(genres, +n_obs), y = n_obs))+
  geom_bar(stat="identity")+
  theme(axis.text.x = element_text(angle = 90, vjust = 1, hjust=1)) +
  coord_flip()

genre_freq

For the bar plot of genre, besides the most frequency keyword indie, other keywords are common when we talk about games. Sport, Racing, Action, Adventures includes most of the games genere.

tags_freq = df|>
  separate_rows(tags, sep = ",")|>
  group_by(tags)|>
  summarise(n_obs = n())|>
  subset(n_obs >= 5000)|>
  arrange(desc(n_obs))|>
  ggplot(aes(x = reorder(tags, +n_obs), y = n_obs))+
  geom_bar(stat="identity")+
  theme(axis.text.x = element_text(angle = 90, vjust = 1, hjust=1)) +
  coord_flip()

tags_freq

As there are too many tags in steam, we only show most frequent tags, interestingly, we found there are more puzzle games compared to RPG, this might because the development cost of puzzle games are lower than RPG.

Trend of keywords frequency across different years

It might be informative to visualize the trend change of game genre developed in different years since WOW might be good game at 2000s, while games like Overwatch might be better game in 2020s.

genre_freq_year = df|>
  mutate(year = year(release_date))|>
  separate_rows(genres, sep = ",")|>
  group_by(year, genres)|>
  summarise(n_obs = n())|>
  group_by(year)
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
genre_freq_year_total = df|>
  mutate(year = year(release_date))|>
  separate_rows(genres, sep = ",")|>
  group_by(year)|>
  summarise(n_obs_total = n())
  

genre_freq_year_total|>
  knitr::kable()
year n_obs_total
1997 3
1998 1
1999 2
2000 2
2001 7
2002 2
2003 7
2004 7
2005 10
2006 79
2007 118
2008 230
2009 477
2010 419
2011 477
2012 659
2013 1011
2014 3312
2015 6109
2016 9853
2017 14823
2018 19051
2019 18579
2020 22777
2021 27694
2022 30735
2023 19446
NA 277

We could see from the table that there are very few games on steam before 2006, genre frequency might be strongly affected by randomness in those years, so we choose only to plot the trend after 2005.

#here we limit the year to later than 2002 since steam is lauched in 2002, we think the game before 2002 might be not representative. 
left_join(genre_freq_year, genre_freq_year_total, by = "year")|>
  subset(year > 2005)|>
  mutate(genre_ratio = n_obs/n_obs_total)|>
  ungroup()|>
  plot_ly(x = ~year, y = ~genre_ratio, color = ~genres, type = "scatter", mode = "lines+markers",  colors = "viridis")
popular_genres_bar_plot = left_join(genre_freq_year, genre_freq_year_total, by = "year")|>
  subset(year > 2005)|>
  mutate(genre_ratio = n_obs/n_obs_total)|>
  ggplot(aes(x = year, y = genre_ratio, fill = genres)) + 
  geom_col(position = 'stack', width = 0.6)+
  theme(axis.text.x = element_text(angle = 90, vjust = 1, hjust=1))
popular_genres_bar_plot

We found some interesting trend from the line plot:

Rise of Indie Games: A prominent observation is the surge in the frequency of the “Indie” genre, particularly after 2012. This phenomenon is likely attributable to the increase in the number of computers per person. These developments have made it easier for independent game developers to create and publish their games.

Shift in Genre Popularity: The line plot also highlights a noteworthy trend in the Action genre. While it remains dominant, it has experienced a gradual decline in frequency over the years. This shift may be influenced by changing player preferences and the emergence of diverse gaming experiences in other genres. It suggests that players are increasingly seeking a wider variety of gaming experiences beyond traditional action games.

Casual Games on the Rise: The period from 2012 to 2023 shows an increase in the frequency of “Casual” games. This may be indicative of a growing market for games that offer accessible and relaxed gameplay experiences. This might also indecate people are craving for relaxation instead of stimulation.

Early Access Strategy: The “Early Access” games on Steam experienced a notable increase in their frequency. This approach provides independent game developers with the opportunity to release their games in a partially completed state, allowing them to gather player feedback, iterate on their games, and secure investment more easily. The popularity of this strategy underscores its effectiveness in the independent game development ecosystem.

Word Cloud visualization

Wordcloud could help visualize the the frequency of keywords in different groups. Here, we separately create word cloud for popular and unpopular games.

overwhelm_popular_wordcloud = df|>
  subset((positive+negative) > 500)|>
  subset(positive/(positive+negative) > 0.95)|>
  separate_rows(tags, sep = ",")|>
  group_by(tags)|>
  summarise(n_obs = n())|>
  wordcloud2()

popular_tags_wordcloud = df_popular|>
  separate_rows(tags, sep = ",")|>
  group_by(tags)|>
  summarise(n_obs = n())|>
  wordcloud2()

unpopular_tags_wordcloud = df_unpopular|>
  separate_rows(tags, sep = ",")|>
  group_by(tags)|>
  summarise(n_obs = n())|>
  wordcloud2()
overwhelm_popular_wordcloud
popular_tags_wordcloud
unpopular_tags_wordcloud

The distinctions revealed among the three word clouds are insightful and provide valuable context for understanding the characteristics that differentiate popular and unpopular games:

Prevalence of Adventure Games in Unpopular Games: Adventure games appear to be more prevalent among unpopular games and are less common in popular games. This discrepancy may stem from the accessibility of game development tools like RPG Maker, which enables the creation of adventure games with relative ease. Consequently, the market may be flooded with a range of adventure games of varying quality, leading to their uneven representation among unpopular titles.

Emphasis on Difficulty in Popular Games: The word cloud for popular games prominently features the tag “difficult.” This aligns with the reality that some of the most highly regarded games in the industry, such as Elden Ring and Sekiro: Shadows Die Twice, are renowned for their challenging gameplay. The presence of this tag underscores the appeal of a high level of difficulty in certain popular titles.

Ratio of 2D to 3D Games: Notably, the word cloud suggests that the ratio of 2D games to 3D games is higher in popular games compared to unpopular ones. This observation may be attributed to resource constraints faced by smaller game development studios. In many cases, these studios may prioritize the development of engaging plotlines and intricate game mechanics over the creation of elaborate 3D environments, resulting in a higher prevalence of 2D games among popular titles.

Popularity of Pixel Graphics: The higher frequency of the “pixel graphics” tag in popular games is an intriguing finding. Pixel graphics are favored by independent developers as they allow a strong focus on game mechanics. This aligns with the success stories of games like Terraria and Minecraft, which have leveraged pixel graphics to create immersive and engaging gameplay experiences. The appeal of pixel graphics in popular games reflects the importance of gameplay and mechanics in driving their popularity.

model fitting

Preventing data leakage

Since it is hard for models to directly analyze keywords as string, we one-hot encoded categories, genres, and tags, which facilitate models to analyze the dataset.

During our data preprocessing phase, we identified certain tags that had the potential to introduce data leakage into our dataset. These tags, listed below, were deemed inappropriate for inclusion due to their subjective and conclusive nature, making them difficult for a game producer to ascertain accurately: Masterpiece, Great Soundtrack, Addictive, benchmark, classic. By excluding these tags from our dataset, we could provide a model that help game makers to determine the game’s potential popularity.

df_concat = df|>
  mutate(keywords = paste(df$categories, df$genres, df$tags, sep = ","))|>
  subset(select = -c(categories, genres, tags))
keywords_df = df_concat|>
  subset(select = c(app_id, keywords))|>
  separate_rows(keywords, sep = ",")|>
  distinct(app_id, keywords, .keep_all = TRUE)|>
  mutate(value = 1)|>
  subset(keywords != "")|>
  pivot_wider(names_from = keywords, values_from = value, values_fill = 0)

one_hot_encoded_df = left_join(df_concat, keywords_df, by = "app_id")

popular_encoded_df = one_hot_encoded_df|>
  subset((positive+negative) > 10)|>
  subset(positive/(positive+negative) > 0.7)|>
  subset(estimated_owners != "0 - 20000")|>
  subset(median_playtime_forever > 120)|>
  mutate(popular = 1)

unpopular_encoded_df = anti_join(one_hot_encoded_df, popular_encoded_df, by="app_id")|>
  mutate(popular = 0)

encoded_with_label_df = rbind(popular_encoded_df, unpopular_encoded_df)|>
  janitor::clean_names()|>
  subset(select = -c(positive, negative, median_playtime_forever, estimated_owners, recommendations, user_score,  supported_languages, keywords, addictive, masterpiece, great_soundtrack, benchmark, classic))|>
  mutate(windows = as.integer(as.logical(windows)))|>
  mutate(mac = as.integer(as.logical(mac)))|>
  mutate(linux = as.integer(as.logical(linux)))

dataset_without_id_name = encoded_with_label_df|>
  subset(select = -c(app_id, name, release_date))|>
  drop_na()

feature selection

There are some redundant feature in our dataset, we might want to find those variables and remove some of them.

#this function could help us find tags that be potential subset of other tags. By getting the dot product of two on-hot encoded vector and compare its sum with current column's sum, if they are equal and larger than 0, they could be possible related tags. 
find_col_pairs = function(df) {
  n = ncol(df)
  result = character(0)
  
  for (i in 1:(n-1)) {
    for (j in (i+1):n) {
      col_i = df[, i]
      col_j = df[, j]
      dot_product = sum(col_i * col_j)
      col_i_sum = sum(col_i)
      col_j_sum = sum(col_j)
      if (col_i_sum == dot_product & dot_product != 0 & col_i_sum != 0) {
        result = c(result, paste(colnames(df)[i], colnames(df)[j], sep = "-"))
      }
      if (col_j_sum == dot_product & dot_product != 0 & col_j_sum != 0) {
        result = c(result, paste(colnames(df)[j], colnames(df)[i], sep = "-"))
      }
    }
  }
  
  return(result)
}

subgroup_features = find_col_pairs(dataset_without_id_name)

From the result, most feature are sub-group of windows which is expected since most games are on windows system. We decide to remove this features since it is not informative to our model. We also found tags batman, fox, birds, football_american, tile_matching, tracked_motion_controller_supportonly present once, coding only appears twice, to make our model more generalizable, we decide to drop these columns.

We identified redundant features such as ‘Multi_player’ and ‘multiplayer,’ as well as ‘Single_player’ and ‘singleplayer.’ Although they are not completely collinear, it was evident that they provided similar information. To enhance model interpretability, we decided to remove one of each redundant pair.

dataset_without_id_name = dataset_without_id_name|>
  subset(select = -c(windows, multiplayer, birds, football_american, fox, batman, coding, tile_matching,tracked_motion_controller_support, singleplayer))

Lasso regression

Here, we choose lasso regression since it could perform feature selection. We uses 5 fold-cv for lambda selecting. We first separate test set and train set to prevent possible data leakage.

#first, we split target and tags, then we split train and test datasets
dataset_without_id_name= dataset_without_id_name[sample(1:nrow(dataset_without_id_name)), ] 
dataset_without_id_name$id = 1:nrow(dataset_without_id_name)
train = dataset_without_id_name |>
  sample_frac(0.70)
test = anti_join(dataset_without_id_name, train, by = 'id')

train = train|>
  subset(select = -c(id))

test = test|>
  subset(select = -c(id))

train_x = train|>
  subset(select = -c(popular))
train_y = train|>
  pull(popular)

test_x = test|>
  subset(select = -c(popular))

test_y = test|>
  pull(popular)

set.seed(1234)
foldid = sample(1:5, size = nrow(train_x), replace = TRUE)

lambda = 10^(seq(2, -5, -0.1))

lasso_fit = glmnet(
  x = as.matrix(train_x), 
  y = train_y, 
  lambda = lambda,
  alpha=1, 
  family = "binomial"
  )

lasso_cv = cv.glmnet(
  x = as.matrix(train_x), 
  y = train_y, 
  lambda = lambda, 
  foldid = foldid,
  alpha=1, 
  family = "binomial"
)

lambda_opt = lasso_cv$lambda.min
broom::tidy(lasso_fit) |>
  select(term, lambda, estimate) |>
  complete(term, lambda, fill = list(estimate = 0) ) |>
  filter(term != "(Intercept)") |>
  ggplot(aes(x = log(lambda, 10), y = estimate, group = term, color = term)) + 
  geom_path() + 
  geom_vline(xintercept = log(lambda_opt, 10), color = "blue", size = 1.2) +
  theme(legend.position = "none")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

This plot shows the best selection of lambda.

result = predict(lasso_fit, s = lambda_opt, newx = as.matrix(test_x), type = 'response')
prediction_obj = prediction(result, test_y)
performance_obj = performance(prediction_obj, "tpr", "fpr")
auc_value = performance(prediction_obj, "auc")@y.values[[1]]
plot(performance_obj, main = "ROC Curve")
text(0.6, 0.2, paste("AUC =", round(auc_value, 2)), col = "red")

result = as.data.frame(result)|>
  mutate(prediction = ifelse(s1 > 0.5, 1, 0))|>
  mutate(actual = test_y)|>
  mutate(difference = ifelse(prediction != actual, 1, 0))
acc = (nrow(result) - sum(pull(result, difference)))/nrow(result)

result = result|>
  mutate(actual = factor(actual, levels = c(1, 0)))|>
  mutate(prediction = factor(prediction, levels = c(1, 0)))
confusionMatrix(data=pull(result, prediction), reference = pull(result, actual))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     1     0
##          1   576   308
##          0  1126 16979
##                                           
##                Accuracy : 0.9245          
##                  95% CI : (0.9206, 0.9282)
##     No Information Rate : 0.9104          
##     P-Value [Acc > NIR] : 1.648e-12       
##                                           
##                   Kappa : 0.4093          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.33843         
##             Specificity : 0.98218         
##          Pos Pred Value : 0.65158         
##          Neg Pred Value : 0.93781         
##              Prevalence : 0.08963         
##          Detection Rate : 0.03033         
##    Detection Prevalence : 0.04655         
##       Balanced Accuracy : 0.66030         
##                                           
##        'Positive' Class : 1               
## 

Upon examining the confusion matrix, it becomes evident that our model struggles to correctly classify a substantial number of popular games. However, a noteworthy observation is that the rate of false positives is within an acceptable range. This observation suggests that there may be discernible patterns among unpopular games that our model can capture effectively. As a result, this model holds promise in providing valuable insights for identifying games that are less likely to achieve popularity.

Furthermore, our analysis reveals that the model’s performance, as indicated by the p-value, is statistically significant compared to the scenario of having no information. This suggests that our model does contain valuable information that contributes to its predictive capabilities. The significance of the p-value underscores the utility of the model in making informed predictions and decisions.

It is important to note that the model’s sensitivity, particularly in recognizing popular games, is relatively low. One plausible explanation for this discrepancy in sensitivity may stem from the imbalanced ratio of popular games to unpopular games within the dataset. The skewed distribution of popular and unpopular games can pose challenges for the model in learning generalized patterns for popular games, leading to reduced sensitivity. It might be useful to perform a down-sampling, but it is unsure whether such down-sampling will cause information loss.

most important tags and their effect

First, we will extract the beta from our models.

betas = coef(lasso_cv, s = lambda_opt)

beta_summary =  summary(betas)

beta_df = data.frame(tags = rownames(betas)[beta_summary$i],
           Weight = beta_summary$x)|>
  mutate(abs_weight = abs(Weight))|>
  arrange(desc(abs_weight))
beta_df|>
  subset(tags != "(Intercept)")|>
  head(30)|>
  ggplot(aes(x = reorder(tags, +abs_weight), y = Weight))+
  geom_bar(stat="identity")+
  coord_flip()

For the coefficient bar plot, we could see some important tags that contribute to popular games. It do fit our experience, social deduction games like Goose Goose Duck is one of the most popular game currently. Dungeons and Dragons and epic is the tags for Divinity: Original Sin II and Baldur’s Gate 3, they gain huge sales and reputation. Besides genre and mechanism, having steam trade cards and could remote play on tablet might act as counfounding since they shows the the meticulousness of game makers, which will potentially lead to a better game.